Understanding im2col implementation in Python(numpy fancy indexing)


I see one of the im2col implement in Python as following in the Assignment2 of CS231n:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
def get_im2col_indices(x_shape, field_height, field_width, padding=1, stride=1):
# First figure out what the size of the output should be
N, C, H, W = x_shape
assert (H + 2 * padding - field_height) % stride == 0
assert (W + 2 * padding - field_height) % stride == 0
out_height = (H + 2 * padding - field_height) / stride + 1
out_width = (W + 2 * padding - field_width) / stride + 1

i0 = np.repeat(np.arange(field_height), field_width)
i0 = np.tile(i0, C)
i1 = stride * np.repeat(np.arange(out_height), out_width)
j0 = np.tile(np.arange(field_width), field_height * C)
j1 = stride * np.tile(np.arange(out_width), out_height)
i = i0.reshape(-1, 1) + i1.reshape(1, -1)
j = j0.reshape(-1, 1) + j1.reshape(1, -1)

k = np.repeat(np.arange(C), field_height * field_width).reshape(-1, 1)

return (k, i, j)

def im2col_indices(x, field_height, field_width, padding=1, stride=1):
""" An implementation of im2col based on some fancy indexing """
# Zero-pad the input
p = padding
x_padded = np.pad(x, ((0, 0), (0, 0), (p, p), (p, p)), mode='constant')

k, i, j = get_im2col_indices(x.shape, field_height, field_width, padding,
stride)

cols = x_padded[:, k, i, j]
C = x.shape[1]
cols = cols.transpose(1, 2, 0).reshape(field_height * field_width * C, -1)
return cols

Even if I've used Python for three years long, It is still something mindblowing for me to understand how it works at first! In order to understand the mechanism of this function. It's time to learn more about the fancy indexing in the Python.

When we implement the Backpropagation of Neural Network, we might see something like this to compute the final Loss function.

1
loss = some_loss_fun(scores[np.arange(N), target_y])

We know it's to extract the elements of according position out, the output will be (N,). At first you will tend to treat it as a small trick to memorize it and don't care what's going on hehind. But if you start to look at something like this, you will see this index can do things far beyond your thoughts(I also had no idea what it will give me before I ran this code:).)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
In [59]: k = np.arange(24).reshape(2,3,4)

In [60]: k
Out[60]:
array([[[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]],

[[12, 13, 14, 15],
[16, 17, 18, 19],
[20, 21, 22, 23]]])

In [62]: k[:,np.arange(2),np.arange(2).reshape(-1,1) * np.arange(2) ]
Out[62]:
array([[[ 0, 4],
[ 0, 5]],

[[12, 16],
[12, 17]]])

Actually at the beginning this fancy indexing, Python will try to broadcast the indexing arrays into the same dimension, And a small caveat here, broadcasting will always first look for trailing dimensions. In our example, np.arange(2) in this second dimension after broadcasting will become the left one instead of the right one.

1
2
[[ 0,  1],  right!       [[ 0,  0], #wrong!
[ 0, 1]] [ 1, 1]]

And then, it will extract the correspond elements as we might think. Eventually the think will like the outcome we run above.

While writing this, I also find a interesting example in the document of numpy to demonstrate the magic of broadcast. If you have more interest in it. Please click here.

1
2
3
A      (4d array):  8 x 1 x 6 x 1
B (3d array): 7 x 1 x 5
Result (4d array): 8 x 7 x 6 x 5

Besides, we need to see : is sort of special in the indexing, which means the outcome of these two piece of code will be different.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
In [77]: k[:,:,np.arange(2).reshape(-1,1) * np.arange(2) ] #This will give a matrix of shape(2,3,2,2)
Out[77]:
array [[[[ 0, 0],
[ 0, 1]],

[[ 4, 4],
[ 4, 5]],

[[ 8, 8],
[ 8, 9]]],


[[[12, 12],
[12, 13]],

[[16, 16],
[16, 17]],

[[20, 20],
[20, 21]]]])
In [78]: k[:,np.arange(3),np.arange(2).reshape(-1,1) * np.arange(2) ] #this will give you an error because it can not broadcast properly.

With this thorough understanding of Python Fancy Indexing, we will go back and see the intricated implement of im2col.

We first need to examine the dimension of these three outcome indices

k is of shape (C * field_height * field_width, 1) i is of shape (C * field_height * field_width, out_heightout_width) j is of shape (C field_height * field_width, out_height*out_width)

So when we use index in x_padded[:, k, i, j], we are extract out a matrix of 3-D (N, C * field_height * field_width, out_heightout_width), the broadcasted index matrix is of dimension (C field_height * field_width, out_heightout_width). Remember each neuron in filters in CNN will generally have C field_height * field_width weight on it.

Now it might be clearlier for you to see what's going on here. We will take the first slice of index matrix (C * field_height * field_width, 1), which give you the outcome of first neuron to guide you the whole process. There will be out_height*out_width neurons in a single filter

I will refer to second, third, fourth dimension of x(image) as C, H, W position.

  1. The index in the C position tries to traverse different channel, for each channel, there will be field_height * field_width numbers to be used, which indicates how many pixel every neuron try to capture in a single channel. Index will look like this:

\[ \underbrace{\underbrace{c,\dots,c}_{length\ of\ F\_H * F\_W } }_{c\ =\ 0,\dots,C-1} \]

  1. The index in the H channel is try to traverse the height of each position in the neural. Due to the fact the implementation will get width-side pixel first. so it will have index like:

\[ \underbrace{\underbrace{0,\dots,0}_{length\ of\ F\_W},\dots, \underbrace{F\_H-1,\dots,F\_H-1}_{length\ of\ F\_W} }_{rep\ for\ C\ times} \]

  1. Accordingly,The index in the W channel looks like this:

\[ \underbrace{\underbrace{0,1, \dots,F\_W-1}_{F\_H\ times}}_{rep\ for\ C\ times} \]

The i1, j1 in functionget_im2col_indices is used to tuned the width and height index positionS according to different neurons. Because we also traverse width side first for all neurons in each filter, so we use np.tile for j1(row), np.repeat for i1(column).

By the way, Python always operates first on trailing dimensions. For instance, the reshape will first spread out or gather the row-side in 2-D array:

1
2
3
4
5
6
7
8
9
10
In [84]: np.array(list(range(12))).reshape(3,4)
Out[84]:
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])

#instead of
array([[ 0, 3, 6, 9],
[ 1, 4, 7, 10],
[ 2, 5, 8, 11]])

Bonus Part: Some tricks regarding numpy I found accidentally:

  1. You can use index None to add one more dimension to your np.array with ease. But be careful, the newly created object still get reference to the original one. Revising one will change another!
1
2
3
4
5
In [91]: x.shape
Out[91]: (3, 6)

In [92]: x[None].shape
Out[92]: (1, 3, 6)

(To be continued)


f you find any mistakes in this passage, feel free to contact me through my email: FDSM_lhn@yahoo.com.

You're welcome to share my passage to other websites or bloggers. But please add my name and link to this post alongside. Thank you!

Haonan Li wechat